Optimal Scheduling in General Multi-Queue System by Combining Simulation and Neural Network Techniques

The problem of optimal scheduling in a system with parallel queues and a single server has been extensively studied in queueing theory. However, such systems have mostly been analysed by assuming homogeneous attributes of arrival and service processes, or Markov queueing models were usually assumed in heterogeneous cases. The calculation of the optimal scheduling policy in such a queueing system with switching costs and arbitrary inter-arrival and service time distributions is not a trivial task. In this paper, we propose to combine simulation and neural network techniques to solve this problem. The scheduling in this system is performed by means of a neural network informing the controller at a service completion epoch on a queue index which has to be serviced next. We adapt the simulated annealing algorithm to optimize the weights and the biases of the multi-layer neural network initially trained on some arbitrary heuristic control policy with the aim to minimize the average cost function which in turn can be calculated only via simulation. To verify the quality of the obtained optimal solutions, the optimal scheduling policy was calculated by solving a Markov decision problem formulated for the corresponding Markovian counterpart. The results of numerical analysis show the effectiveness of this approach to find the optimal deterministic control policy for the routing, scheduling or resource allocation in general queueing systems. Moreover, a comparison of the results obtained for different distributions illustrates statistical insensitivity of the optimal scheduling policy to the shape of inter-arrival and service time distributions for the same first moments.


Introduction
Machine learning algorithms have been used over the last ten years in almost all fields where problems associated with data classification, pattern recognition, non-linear regression, etc., have to be solved. The application of such algorithms has also intensified in the field of queueing theory. While the first steps in the successful application of machine learning to evaluate the performance characteristics of simple and complex queueing systems have already been taken, the total number of works on this topic still remains modest. As for reviews, we can only refer to a recent paper by Vishnevsky and Gorbunova [1] which proposes a systematic introduction to the use of machine learning in the study of queueing systems and networks. Before we formulate our specific problem we would like also to make a small contribution to the popularisation of machine learning in the queueing theory by describing briefly the latest works. In Stintzing and Norrman [2], an artificial neural network was used for predicting the number of busy servers in the M/M/s queueing system. The papers of Nii et al. [3] and Sherzer et al. [4] have answered positively the question regarding whether the machines could be useful for solving the problems in general queueing systems. They employed a neural network approach to estimate the mean performance measures of the multi-server queues GI/G/s based on the first two moments of the inter-arrival and service time distributions. A machine learning approach was used in the work of Kyritsis and Deriaz [5] to predict the waiting time in queueing scenarios. The combination of a simulation and machine learning techniques for assessing the performance characteristics was illustrated in Vishnevsky et al. [6] on a queueing system MMAP/PH/M/N with K priority classes. Markovian queues were simulated using artificial neural networks in Sivakami et al. [7]. Neural networks were used also in research by Efrosinin and Stepanova [8] to estimate the optimal threshold policy in a heterogeneous M/M/K queueing system. The combination of the Markov decision problem and neural networks for the heterogeneous queueing model with process sharing was studied by Efrosinin et al. [9]. The performance parameters of the closed queueing network by means of a neural network were evaluated in Gorbunova and Vishnevsky [10]. In addition to the presented results of using neural networks in hypothetical queueing theory models, academic studies in this area with real-world applications have gradually been proposed. For example, the problem regarding the choice of an optimum chargingdischarging schedule for electric vehicles with the usage of a neural network is proposed by Aljafari et al. [11]. The main conclusion to be drawn from the previous results obtained via the application of machine learning to models of the queueing theory is that the neural networks cannot be treated as a replacement for classical methods in system performance analysis, but rather as a complement to the capabilities of such an analysis.
The systems with parallel queues and one server are known also as polling systems which have found wide application in various fields such as computer networks, telecommunications systems, control in manufacturing and road traffic. For analytic and numerical results in various types of polling systems with applications to broadband wireless Wi-Fi and Wi-MAX networks, we refer interested readers to the textbook by Vishnevsky and Semenova [12] and the references therein. The same authors in [13] developed their research on polling systems to systems with correlated arrival flows such as MAP, BMAP, and the group Poisson arrivals. In Vishnevskiy et al. [14], it was shown that the results obtained by a neural network are close enough to the results of analytical or simulation calculations for the M/M/1 and MAP/M/1-type polling systems with cyclic polling. Markovian versions of a single-server model with parallel queues have been investigated by a number of authors. The two-queue homogeneous model with equal service rates and holding costs was studied in Horfi and Ross [15], where it was shown that the queues must be serviced exhaustively according to the optimal policy. In research by Liu et al. [16], it was shown that the scheduling policy that routes the server with respect to the LQF (Longest Queue First) policy is optimal when all queue lengths are known and that the cyclic scheduling policy is optimal in cases where the only information available is the previous decisions. The systems with multiple heterogeneous queues in different settings, also known as asymmetric polling systems, have been studied intensively in cases where there are no switching costs by Buyukkoc et al. [17], Cox and Smith [18], where the optimality of the static cµ-rule was proved. This policy schedules a server first to the queue i with a maximum weight c i µ i consisting of the holding cost and service rate. In Koole [19], the problem of optimal control in a two-queue system was analysed by means of the continuous-time Markov decision process and dynamic programming approach. The author found numerically that the optimal policy which minimizes the average cost per unit of time can be quite complex if there are both holding and switching costs. The threshold-based policy for such a queueing system was applied by Avram and Gómez-Corral [20], where the expressions for the long-run expected average cost of holding units and switching actions of the server were given. The queueing system with general service times and set-up costs which have an effect on the instantaneous switch from one queue to another was studied in Duenyas and Van Oyen [21]. The authors proposed a simple heuristic scheduling policy for the system with multiple queues. A rather similar model is described in Matsumoto [22], where the optimal scheduling problem is solved in a system with arbitrary time distributions. Here, instead of switching costs, the corresponding set-up time intervals required for switching are used. The system is controlled by the Learning Vector Quantization (LVQ) network, see Kohonen [23] for details, which classifies the system state by the closest codebook vector of a certain class in terms of the Euclidean metric. The problem with this approach is the large number of parameters associated with the codebook vectors, where it is normally required that several vectors per class must be estimated for a given control policy using computationally expensive recurrent algorithms.
This paper proposes a fairly universal method for solving the problem of optimal dynamic scheduling or allocation in queueing systems of the general type, i.e., where the times between events are arbitrarily distributed, and in queueing systems with correlated inter-arrival and service times. Furthermore, it can provide a performance analysis of complex controlled systems described by multidimensional random processes, for which finding analytical, approximate or heuristic solutions is a difficult task. The main idea of the paper is to use a multi-layer neural network for server scheduling. The parameters of this neural network trained first on some arbitrary control policy are optimized then with the aim to minimize a specified average cost function. Moreover, such a cost function for systems with arbitrary inter-arrival and service time distributions can only be computed via simulation. We consider this approach, which combines neural networks with simulation technique, to be quite universal to obtain an optimal deterministic control policy in complicated queueing systems. The method is exemplified by some version of a single-server system with parallel queues equipped with a controller for scheduling a server. The system under study is assumed to have heterogeneous arrival and service attributes, i.e., unequal arrival and service rates, as well as holding and switching costs. Systems with arbitrary distributions and switching costs have not yet been considered by other authors. It is assumed in our model that the queue currently being served by the server is serviced exhaustively. The next queue to be served by the server is selected according to a dynamic scheduling policy based on the queue state information, i.e., on the number of customers waiting in each of parallel queues. It is expected that the changing of the serviced queue involves the switching costs. The holding of a customer in the system is also linked to the corresponding cost. Clearly, even with some fixed scheduling control policy, calculating any characteristics of the proposed queueing system with arbitrary inter-arrival and service time distributions in explicit form is not a trivial task. It is also difficult to fix the dynamic control policy defining the scheduling in large systems in a standard way, e.g., through a control matrix that would contain the corresponding control action for all possible states of the system. Therefore, in such a case we consider it justified to solve the problem of finding the optimal scheduling policy with the aim to minimize the average cost per unit of time by combining the simulation as a tool to calculate the performance characteristics of the system with a machine learning technique, where the neural network will be responsible for dynamic control. By training a neural network for some initial control policy, we obtain characteristics of the network in the form of a matrix of weights and a vector of biases. The process of solving the optimal scheduling problem is then reduced to a discrete parametric optimization. The parameters of the neural network must be optimized in such a way that this network can guarantee the minimal values of the average cost functional by generating control actions at decision epochs. For this purpose, we have chosen one of the random search methods, such as simulated annealing, see, e.g., in Aarts and Korst [24], Ahmed [25]. It is a heuristic method based on a concept of heating and controlled cooling in metallurgy and is normally used for global optimization problems in a large search space without any assumption on the form of the objective function. This algorithm was implemented by Gallo and Capozzi [26] specifically for the probabilistic scheduling problem. The algorithm will be adapted for a non-explicitly defined parametric function with a large number of variables defined on a discrete domain.
To verify the quality of the calculated optimal parameters of the neural network, the values of the average cost functional for the markovian version of the queueing system are compared with the results obtained by solving the Markov decision problem (MDP). The general theory on MDP models is discussed in Puterman [27] and Tijms [28]. The details on application of MDP to controlled queueing systems with heterogeneous servers can be found in Efrosinin [29]. The optimal control policy and the corresponding objective function are calculated in the paper via a policy-iteration algorithm proposed in Howard [30] for an arbitrary finite-state Markov decision process. According to the MDP, the router in our system has to find an optimal control action in the state visited at a decision epoch with the aim to minimize the long-run average cost. Note that for our queueing model under general assumptions the semi-Markov decision problem (SMDP) can be formulated. The SMDP is a more powerful model than the MDP since the time spent by the system in each state before a transition is taken into account by calculating the objective function. The objective function must be calculated here also by means of a simulation. In this case, the reinforcement learning algorithm, e.g., Q-P-Learning, can be applied. The main problem of this approach consists of the fact that many pairs of state and action can remain non-observable for deterministic control policy and as a result the control actions in such states can not be optimized. However, in our opinion, neural networks can also be used to solve this problem which presents a potential task for further research. The SMDP topic is outside the scope of this article but we refer readers to work by Gosavi [31], where one can find a very interesting overview on reinforcement learning and a well-designed classification of simulated-based optimization algorithms.
Summarising our research in this paper we can highlight the following main contributions: (a) We propose a new controlled single-server system with parallel queues where the router uses a trained multi-level neural network to perform a scheduling control: (b) A simulated annealing method is adapted to optimize the weights and biases of the neural network with the aim to minimize the average cost function which can be calculated only via simulation; (c) The quality of the resulting optimal scheduling policy is verified solving a Markov decision problem for the Markovian analog of the queueing system; (d) We provide detailed numerical analysis of the optimal scheduling policy and discuss its sensitivity to the shape of the inter-arrival and service time distributions; (e) The distinctive feature of our paper is the presence of algorithms employed in the form of pseudocodes with detailed descriptions of relevant steps.
The rest of the paper is organized as follows. Section 2 presents a formal description of the queueing system and optimization problem. Section 3 describes the Markov decision problem and the policy-iteration algorithm used to calculate optimal scheduling policy. In Section 4, the event-based simulation procedure of the proposed queueing system is discussed. The neural network architecture, parametrization and training algorithm are summarized in Section 5. Section 6 presents simulated annealing optimization algorithm. The numerical analysis is shown in Section 7 and concluding remarks are provided in Section 8.

Single-Server System with Parallel Queues
Consider a single-server system with N parallel heterogeneous queues of the type GI/G/1 and router for scheduling the server across the queues. Heterogeneity here refers to unequal distributions associated with inter-arrival and service times of customers in different queues, as well as unequal holding and switching costs. The queue that is currently being serviced is exhaustively serviced. Denote I = {1, 2, . . . , N} as a queue index set. The proposed queueing system is shown schematically in Figure 1. Denote τ n,i , n ≥ 1 as the time instants of arrivals to queue i and ν i := ν n,i = τ n,i − τ n−1,i , n ≥ 1 as the sequence of mutually independent and identically distributed interarrival times with a CDF A i (t), i ∈ I. Further denote by ζ i := ζ n,i , n ≥ 1, the service time of the nth customer in the ith queue. These random variables are also assumed to be mutually independent and generally distributed with CDF B i (t), i ∈ I. We assume that the random variables ν i and ζ i have at least two first finite moments The squared coefficients of variation are defined then, respectively, as This characteristic will be required to provide a comparison analysis of the optimal scheduling policy for different types of inter-arrival and service time distributions. From now it is assumed that the ergodicity condition is fulfilled, i.e., the traffic load Let D(t) indicate the sequence number of the queue currently being serviced by the server at time t, and Q i (t) denote the number of customers in the ith queue at time t, where i ∈ I. The states of the system at time t are then given by a multidimensional random process with a state space Further in this section, the notations d(x) and q i (x) will be used to identify the corresponding components of the vector state x ∈ E. The cost structure consists of the holding cost c i per unit of time the customer spends in queue i and the switching cost c i,j to switch the server from queue i to queue j. It is assumed that the system states X(t) are constantly monitored by the router which defines the queue index to be serviced next after a current queue becomes empty. In initial state, when the total system is empty, a server is randomly scheduled to some queue. If the ith queue to be served becomes empty, such a moment we call a decision epoch, the router makes a decision by means of the trained neural network whether it must leave the server at the current queue or dispatch it to another queue. The routing to an idle queue is also possible. We remind that the server allocated by the router to a certain queue serves it exhaustively, i.e., it is only possible to change the queue if it becomes empty. Denote by A = I an action space with elements a ∈ A, where a indicates the queue index to be served next after the current queue has been emptied. The subsets A(x) of control actions in state coincide with the action space A. In all other states x from E \Ê the subsets A(x) = {0} includes only a fictitious control action 0 which has no influence on the system's behavior. The router can operate according to some heuristic control policies. It could be for example a Longest Queue First (LQF) policy which is a dynamic one and it prescribes at decision epochs to serve the next queue with the highest number of customers. If there are more than one queue with the same maximal number of customers, the queue number is selected randomly. Alternatively, the static cµ-rule, which needs only the information if a certain queue in non-empty, can be used for scheduling. According to this control policy the queue i with the highest factor c i µ i which is the product of the holding cost and the service intensity, must be serviced next. In the system with totally symmetric queues the former policy is according to [16] optimal. The latter control policy is optimal due to [17] if there is no switching costs, i.e., c i,j = 0. Otherwise, in case of positive switching costs and asymmetric or heterogeneous queues such policies are not optimal with respect to minimization of the average cost per unit of time.
The main idea of an optimal scheduling in our general model is as follows. We will equip the router with a trained neural network which will inform it on the index number of the next queue to which the server should be routed with the aim to reach formulated optimization aims. Obviously, we can only train the neural network on available data sets, i.e., on some heuristic control policy, and then we will need to optimize the network parameters such as the weights and the biases to solve the problem of finding the optimal scheduling policy. In the average cost criterion the limit of the expected average cost over finite time intervals is minimized in a set of admissible policies. The control policy f :Ê → A(x) is a stationary policy which prescribes the usage of a control action f (x) ∈ A(x) whenever at a decision epoch the system state is x ∈ E. The decision epochs arise whenever after serving any queue that queue becomes empty. For studied controllable queueing system operating under a control policy f , the average cost per unit of time for the ergodic system is of the form where S i,j (t) is the random number of switches from queue i to queue j in time interval [0, t]. Expectation E f must be calculated with respect to the control policy f . The policy f * is said to be optimal when for any admissible policy f , Our purpose focuses on a combination of simulation and neural network techniques. To verify the quality of results obtained by solving the optimization problem (4) we formulate an appropriate Markov decision problem. Then we compute the optimal control policy together with the corresponding average cost g * using a policy iteration algorithm, see, e.g., in Howard [30], Puterman [27], Tijms [28], which will be discussed in detail in a subsequent section.

Markov Decision Problem Formulation
Assume that the inter-arrival and service times are exponentially distributed, i.e., ν i ∼ E (λ i ) and ζ i ∼ E (µ i ), i ∈ I. Under Markovian assumption the process (1) is a continuous-time Markov chain with a state space E. The MDP associated with this Markov process is represented as a five-tuple: where state space E, action spaces A and A(x) have been already defined in the previous section.
λ xy is a transition rate to go from state x to state y by choosing a control action a is defined as where is an immediate cost in state x ∈ E by selecting an action a, Here the first summand denotes the total holding cost of customers in all parallel queues in state x which is independent of a control action. Let c(x) = ∑ N i=1 c i q i (x) and if c i = 1, i ∈ I, we get the number of customers in state x. The second summand includes the fixed cost c j,a for switching the server from the current queue j to the next queue with an index a.
The optimal control policy f * and the corresponding average cost g f * are the solutions of the system of Bellman optimality equations, where B is a dynamic programming operator acting on value function v : E → R.

Proposition 1. The dynamic programming operator B is defined as
Proof. From the Markov decision theory, e.g., [27,28], it is known that for continuous time Markov chain the operator B can be defined as Bv(x) = min a c(x, a) + ∑ y =x λ xy v(y) . This equality for the proposed system can be obviously rewritten in form (8). In this equation, the first term c(x) represents the immediate holding cost of customers in state x. The second term by λ i describes the changes in value function due to new arrivals to the system. The third term by µ j for q j (x) > 1 stands for the value function by service completion in the queue j where there are customers waiting for service. The last term by µ j for q j (x) = 1 describes also a service completion which leads now to the state with an empty queue when a control action must be performed. Hence only the last term occurs with a min operator.
Note that the state space of the Markov decision model is countable infinite and the immediate costs c(x, a) are unbounded. The existence of the optimal stationary policy and convergence of the policy iteration algorithm can be verified for the system under study in a similar way as in Özkan and Kharoufeh [32], where first, the convergence of the value iteration algorithm for the equivalent discounted model is proved, and then, using the criteria proposed in Sennott [33], this result is extended to the policy iteration algorithm for the average cost criterion.
To solve Equation (8) in the policy iteration algorithm required to calculate the optimal control policy, we convert the multidimensional state space into a one-dimensional space by mapping ∆ : E → N 0 . The buffer sizes of the queues must be obviously truncated, namely B i < ∞. Thereby the state x = (d, q 1 , . . . , q N ) can be rewritten in the following form: where β i,j = ∏ j k=i (B k + 1) with β N,N−1 = 1. The notation ∆ −1 (s) will be used for the inverse function. In one-dimensional case the state transitions can be expressed as The set of states E in truncated model is finite with a cardinality |E| = Nβ 1,N . The policy iteration Algorithm 1 consists of two main steps: Policy evaluation and policy improvement. In first step for the given initial control policy, it can be for example the LQF policy, the system of linear equations with constant coefficients must be solved. To make the system solvable the value function v(s) for one of the states can be assumed to be an arbitrary constant, e.g., v(0) = 0 in the first state with d = 1 and q i = 0. In this case we obtain from the optimality Equation (7) . The remaining equations can be solved numerically. As a solution we get the |E| values v(s) and the current value of the average cost g. In the policy improvement step, a control action a that minimizes the test value in the right-hand side of Equation (7) must be evaluated. The algorithm generates a sequence of control policies that converges to the optimum one. The convergence of the algorithm requires that the control actions in two adjacent iterations coincide in each state. To avoid policy improvement bouncing between equally good control actions in a given state, one can simply keep the previous control action unchanged if the corresponding test function is at least as large as for any other policy in determining the new policy. As an alternative to the proposed convergence criterion, one can use the values of average costs the variation of which should be for example less than a given some small value. Example 1. Consider the queueing system with N = 4 queues. The buffer sizes are equal to B i = 10, i ∈ I. At these settings the number of states already reaches large values, |E| = 58, 564, which confirms one of significant restrictions on application of dynamic programming for this type of control problems. The switching costs can be defined for example as c i,j = j − i + 4 mod 4. The holding costs c i for simplicity are assumed to be equal. The values of system parameters λ i , µ i , c i and c i,j are summarized in Table 1 and reflect heterogeneity of the system parameters, i.e., λ i = 0.05i and µ i = 3.750 i .
Initial policy end for 8: 10: else n ← n + 1, go to step 4 11: end if 12: end procedure These values correspond to the system load ρ = ∑ N i=1 ρ i = 0.4, that is the system is stable. This value is enough small to ensure on the one hand that the system is sufficiently loaded so that states appear where all queues are not empty, and on the other hand to minimize the probability of losing an arriving customer for given rather small buffer sizes. The solution of the large system of optimality equations is carried out numerically. The optimized average cost is g * = 2.5632.
Using Algorithm 1, we calculate the optimal scheduling policy. For some of states with fixed number of customers in the third and the fourth queues and varied number of customers in the first two queues the control actions are listed in Table 2. The first row of the table contains the values of the number of customers q 2 and q 1 in the second or first queue when a decision is made, respectively, when the first or second queue is emptied. The first column contains some selected states of the system for the fixed levels q 3 and q 4 of the third and fourth queues. As we can see, the optimal scheduling policy has a complex structure with a large number of thresholds, making it difficult to obtain any acceptable heuristic solution explicitly. To better visualise the complexity in structure of the optimal control policy, the background of the table cells changes in grey colour from darker to lighter backgrounds as the queue index decreases. The cµ-rule as expected is not optimal here, g cµ = 6.7237 that is almost two and a half times more than the value of the average cost under the optimal policy. When the values q 1 and q 2 are small, the router schedules the server to serve the queues with low service rates. In this case the switching costs are low as well. According to the optimal scheduling policy the initiative to route a server to the queue with a higher service rate and switching costs increases as the length of the first two queues increases. (2, q 1 , 0, 5, 5) 3  3  3  3  3  3  3  3  3  3  3 (2, q 1 , 0, 8, 8)  3  3  3  3  3  3  3  3  3  3  3 (2, q 1 , 0, 9, 9) 3  3  3  3  3  3  3  3  1  1  1 Example 2. In this example we increase the arrival rates λ i as given in Table 3. The other parameters are fixed at the same values as in the previous example. The load factor now is ρ = 0.64, and the corresponding optimized average cost is g * = 3.8201 and g cµ = 7.0420.  The Table 4 of scheduling policy shows that as the system load increases the router switches the server to queue 2 or to queue 1 with a higher service rates at almost all queue lengths q 1 and q 2 , respectively.

Event-Based Simulation for General Model
We use an event-based simulation to simulate the proposed queueing system. This technique is suitable for random process evaluation where it is sufficient to have the information about the time instants when changes in states occur. Such changes will be referred to as events. Note that although simulation modelling is extensively used in queueing theory, many papers lack explicitly described algorithms that readers can use for independent research. For more information on simulation methods with applications to single-and multi-server queueing systems, we can recommend Ebert et al. [34] and Franzl [35]. In this regard, it will certainly not be superfluous if we present and discuss here an algorithm for the system simulation which is not difficult to adapt for other similar systems.
In our case, the events are the arrivals to one of N parallel queues and the departures of customers from the queue d currently being served by the server. The present time is selected as a global time reference.
In Figure 2, on the time axis we mark the moments of arrival of new customers and the moments of their service in a fixed queue with index d by means of arrows above and below the axis, respectively. The dotted arrows indicate the arrival of new customers in other queues. The successive events are denoted by ε i and the corresponding time moments by t(ε i ). In the proposed queue simulation Algorithm 2 all the times are referred to the present time. Suppose that at the present moment of time there is a new arrival to the queue with the number d, which is serviced by the server, i.e., t(ε i ) = 0. Denote by T x (ε i ) the holding time of the system in state x up to the occurrence of the event ε i . According to the time schema the holding time in a previous state is defined as remaining inter-arrival time to the queue d, T b (d) stands for the generated service time after the event ε i−2 of the previously occurred departure and the dots replace the time intervals associated with arrivals of customers in other queues. The next event is determined then by subtracting the holding time t i from the all event time intervals. In this case the current event is a new arrival. Thus, the holding time t i+1 in state up to the event ε i+1 of an arrival to some other queue which not equal to d is calculated by The subsequent holding times are calculated as follows, j=i+1 T x (ε j ) is a remaining inter-arrival time for the next arrival to the queue d. Continuing the process in a similar manner, all holding times of the system in the corresponding states are evaluated. By summing up the times t i we obtain the total simulation running time of the system simT. The average cost per unit of time is then obtained by division of the accumulated cost by the time symT.

Neural Network Architecture
In our model, we propose to equip the router with a trained neural network. This network will determine an index of the queue that the server will serve next based on the information about the system state at a decision epoch when the server finishes service of the current queue. We have chosen a simple architecture for the neural network consisting of only two layers in such a way that, on the one hand, it would have a small number of parameters for further optimization and, on the other hand, that the quality of correct classification of some fixed initial control policy would be equal to at least 95%. The proposed neural network has one linear layer which represents an affine transformation and softmax normalization layer as illustrated in Figure 3. The input includes N + 1 neurons according to the system state x = (d, q 1 , . . . , q N ), where q d (x) = 0. The neuron 0 gets the information on d(x), the ith neuron for i ∈ I gets the information on the state of ith queue. When the server finishes service at queue d, then the neural network classifies this state to one of N classes which defines a current control action a ∈ A in state x. The hidden linear layer consists of N neurons y = (y 1 , . . . , y N ) which are connected with an input neurons via the system of linear equations with w i = (w i,0 , w i,1 , . . . , w i,N ) are, respectively, the matrix of weights and the vector of biases of the given neural network which must be estimated by means of the training set. The softmax layer z = softmax(y) is a final layer of the multiclass classification. The softmax layer generates as an output the vector of N estimated probabilities of the input sample y i , where the ith entry is the likelihood that x belongs to class i. The vector y is normalized by the transformation The class number is then defined asâ = arg max z i . Hence, the output z is a mapping of the form z = ϕ(x, θ), where θ ∈ R N(N+2) is the parameter vector of the neural network which includes all entries of the weight matrix W ∈ R N×(N+1) and the bias vector B ∈ R N , i.e., θ = (w 1 , w 2 , . . . , w N , B ).
The values of the parameter vector θ of the initial control policy, which in the next section will be used as a starting solution for optimization procedure, are obtained by training the neural network on some known heuristic control policy. In our case this policy is the LQF.
In the training phase the following optimization problem must be solved given the training where a non-negative loss function , θ] takes the value 0 only if the class of the kth element of a sample is i, i.e.,â = a (k) . The problem (12) can be solved in a usual way by the stochastic gradient descent method, where a single learning rate η to update all parameters is maintained. The corresponding iterative expression is given below, where ∇ θ is a Nabla-operator defining the gradient of the function relative to the parameter vector θ. In our calculations we use the adaptive moment estimation algorithm (ADAM) to solve the problem (12). It updates iteratively the parameters of the neural network based on training data. The ADAM calculates independent adaptive learning rates for the elements of θ by evaluating the first-moment and second moment estimation of the gradient. The method is simple to implement, computationally efficient, requires little memory and is invariant to diagonal changes in gradients. The further detailed information regarding ADAM algorithm can be found in Kingma and Ba [36]. Despite the fact that the ADAM algorithm can be found across various sources, we have also chosen to cite it in this article. The main steps required for the iterative updating the parameter vector θ are summarized in the Algorithm 3. The parameters of the Algorithm 3 are fixed to η = 0.001, β 1 = 0.9, β 2 = 0.999, ε = 10 −8 and δ = 0.001. The classification accuracy of the proposed neural network trained on the LQF policy is over 97%. The test phases of the trained network were conducted on system states with a queue length of up to 100 customers per queue. Thus, this starting network can be used to generate control actions of the initial control policy for subsequent parameters' optimization of this neural network.
Calculate the gradient at step n 9: Update the biased first moment 10: Update the biased second moment 11: The bias-corrected first moment 12: The bias-corrected second moment 13: Update the parameter vector 14: if |θ (n) − θ (n−1) | < δ then return θ (n)

Optimization of the Neural-Network-Based Scheduling Policy
Denote by θ the known parameter vector of the trained neural network as was defined in (11). The function g(θ) means the average cost for the queueing system where the router chooses an action obtained from the trained neural network with the parameter vector θ. We adapt further a simulated annealing method described in Algorithm 4 for discrete stochastic optimization of the average cost function with a multidimensional parameter vector θ. This algorithm is quite straightforward. It needs some starting solution and in each iteration the algorithm evaluates for the randomly selected neighbor values of the function parameters the corresponding function value. If the neighbor occurs to be better than the current solution with respect to value of the objective function, algorithms replaces the current solution with a new one. If the neighbor value is worse, the algorithm keeps the current solution with a high probability and chooses a new value with a specified low probability. The simulated annealing requires the finite discrete space for the parameters of the optimized function. It is assumed that all weights and biases of the neural network summarized in the vector θ take values in the interval [θ min , θ max ] with a low bound θ min and an upper bound θ max . Moreover, this interval is quantized in such a way that θ i , i = 1, . . . , N(N + 2), takes only discrete values θ min + k∆, k = 0, 1, . . . , Q, where Q = θ max −θ min ∆ is a quantization level. Note that the domains for the elements of the parameter vector θ can be specified separately, and the values of the vector obtained by training the neural network based on the optimal policy of the Markov model will be suitable for determining the possible maximum and minimum bounds. In this case it is possible to achieve faster convergence of Algorithm 4 to the optimal value. g * ←ḡ(θ (n) ), θ * ← θ (n)

18:
else θ (n) ← θ (n−1) , m ← m + 1 19: end if 20: end while 21: end procedure Since the average cost function g can not be calculated analytically, for this purpose a simulation technique is used. As shown in Algorithm 4, at each iteration at the step where the current solution can be accepted with a given probability we need to calculate the difference between the object functions. Due to the fact that this function can only be calculated numerically, it is necessary to check whether this difference is statistically significant at each iteration of the algorithm. The algorithm is modified in such a way that the t-test for two samples is used to compare the expected values of two normally distributed samples with unknown but equal variances. Denote by θ 1 and θ 2 , respectively, the current and the modified parameter vector and bȳ two corresponding first empirical moments of the objective function. According to the t-test the null hypothesis which states that for the modified vector the average cost is statistically smaller then the previous solution is rejected if where t m;q stands for the q-quantile of the t-distribution and statistics S g(θ 1 ),g(θ 2 ) is defined as with empirical variances V (m) g(θ 1 ) and V (m) g(θ 2 ) . Below, we briefly describe the main steps of the Algorithm 4. At the initialisation step of the algorithm, the neural network is trained based on the LQF control policy. The parameter vector is then equal to the initial vector θ (0) to be optimized. The simulation Algorithm 2 is then used to calculate the initial sample {g (k) (θ (0) )} m k=1 with g (k) (θ (0) ) = QSIM(. . . ) of the average cost function for a given initial parameter vector θ (0) and the corresponding first empirical momentḡ(θ (0) ). These values are set as the current solution g * and θ * to the optimization problem (13). At the perturbation step, a randomly chosen element of the previous parameter vector θ (n−1) must be randomly perturbed on the specified set of average costs must be calculated together with the first empirical momentḡ(θ (n) ). At the acceptance step, a new policy θ (n) can be accepted as a current solution with a probability p defined as where T(n) is the temperature at the nth iteration. If a new policy θ (n) is accepted, then it is defined together with a corresponding average costḡ(θ (n) ) as a current solution. Otherwise, the last change in the parameter vector θ (n−1) must be reversed, i.e., θ (n) = θ (n−1) and the sample size m for calculating the first moments is updated. Then the perturbation step must be repeated. For termination of the algorithm the stopping criteria T(n) < τ or n < ν is used. We note that the classical simulated annealing method generates for some function g(θ) a sample θ (n) which for the constant temperature T(n) = T can be interpreted as a realization of a homogeneous Markov chain {Θ n } { n ∈ N 0 } with transition probabilities where U n is a uniformly distributed random variable on the interval [0, 1]. It is easy to show that the modified transition probabilities, where the objective function is calculated numerically, converges to the transition probabilities (17) which in turn can guarantee the convergence to an optimal solution. Proposition 2. The acceptance probability p(n) satisfies the limit relation Proof. The probability P[U n ≤ X] can be obviously rewritten as . Then the following relation holds, due to the strong law of large numbers and the fact that for n → ∞ the sample size m → ∞ and hence

Numerical Analysis
Consider the queueing system with N = 4. We first analyse a Markov model, where the parallel queues are of the type M/M/1 with ν i ∼ E (λ i ) and ζ i ∼ E (µ i ), i ∈ I, the coefficient of variation CV 2 ν i = CV 2 ζ i = 1. The values of system parameters λ i and µ i are fixed as in examples 1 and 2 which will refer to as Cases 1 and 2. We compare the optimization results obtained by combining the simulation, neural network and simulated annealing algorithm with the results evaluated by the policy iteration algorithm. In Cases 1 and 2, the weights and the biases of the neural network trained on the calculated by PIA optimal scheduling policy take, respectively, the following values On the basis of these values, we can estimate in the simulation annealing Algorithm 4 the domain or solution space for each element of the vector θ. For simplicity, in our experiments we set common boundaries for all elements as θ min = −6 and θ max = 6. The length of the increment ∆ = 0.1 implies the quantization level Q = 120. Next, we set η = 6, ν = 200, and T(n) = 0.2 log(n) . As an initial vector θ (0) we take the parameter vector obtained by training the neural network on the LQF policy. For the initial control policy, one could also choose the policy W PIA , B PIA obtained by Algorithm 1. However, we would like to check the convergence of the algorithm when choosing not the best initial solution, since in general case one usually chooses either some heuristic policy or an arbitrary one. The empiric average costḡ(θ (n) ) for each iteration step is calculated based on sample with a size m ≥ 20. The accumulation of sample data in QSIM Algorithm 2 is carried out after 1000 customers have entered the system and is completed after 5000 customers have entered the system. We see that the elements of matrices W PIA and W SA are different, but they are markedly similar in terms of the elements with dominant values. The optimization process of the scheduling policy is illustrated in Figure 4. In addition to values of the average cost function obtained at each iteration step of the simulated annealing algorithm, the figures show horizontal dotted and dash-dotted lines, respectively, at level of the average cost g LQF = 9.7093 and g cµ = 4.1984 in figure labelled by (a) and g LQF = 11.1740 and g cµ = 5.2546 in figure labelled by (b) for the LQF and cµ heuristic policies. As expected, a non-optimal control policy LQF implies too high average cost. The results look much better for policy cµ, but still the presence of switching costs significantly worsens the performance of this policy. The red horizontal line indicates the average cost g PIA = 2.5632 and g PIA = 3.5500 obtained by solving the Markov decision problem using the policy iteration Algorithm 1. We can observe that the values are quite close to those obtained by random search. However, some small difference may be due, firstly, to the fact that the simulation is used for calculations and the results have a certain scattering, and, secondly, we do not exclude the influence of boundary states in the Markov model, where a buffer size truncation has been used. Testing the hypothesis for the difference between the optimal average costs g * and g PIA at least for our model showed the values to be statistically equivalent. In the figures, we have also marked with triangles those iteration steps with accepted policy (AP) where the perturbed parameter vector has been accepted. The number of accepted points in Case 1 and 2 is equal, respectively, to 98 and 110. From above results in case of exponential time distributions we can make the following observations. If the parameter vector θ (0) with elements W PIA and B PIA is used for the initial scheduling policy, then one can expect the faster convergence of the simulated annealing algorithm to the optimal solution which was confirmed numerically. If an optimal policy for a controlled Markov process is not available, e.g., when the number of queues is too large, in this case it is reasonable to use the static cµ-rule as an initial policy.
The SA algorithm converges to the values g * = 1.6500 and g * = 2.0326, respectively, for Case 1 and 2 with the following optimal policies, Case 1: The average costs for heuristic policies take the values g LQF = 3.7333, g cµ = 2.8000, g PIA = 1.6500 and g LQF = 5.0133, g cµ = 3.9866, g PIA = 2.7373.  It is observed that the optimal policy obtained by the SA algorithm is quite close to those obtained by the PIA. Nevertheless, from experiment to experiment certain deviations in the value of the average costs may appear. Therefore it is of interest for us to check whether such differences are statistically significant.
Further we analyse how sensitive is the optimal policy obtained in exponential case by the SA algorithm to the shape of arrival and service time distributions. The following distributions will be used to calculate the optimal control policy in the non-exponential case: gamma G(α, β), log-normal LN (µ, σ) and Pareto P R(α, k) distributions , where two last options belong to a set of heavy tail distributions. The parameters of these distributions are chosen so that their first and second moments coincide. Moreover, the first moments are the same as for exponential distributions. The moments need to be represented as functions depending on the corresponding sample moments as in the method of moments used for parameter estimation. In the following experiments, the first moments of the inter-arrival and service times are fixed at values of Case 2, and the squared coefficient of variation is varied as CV 2 ν i = CV 2 ζ i = 0.5 and CV 2 ν i = CV 2 ζ i = 20. Denote by {Z (k) } m k=1 a sample random variable Z distributed according to the proposed distributions with two first sample momentsZ,Z 2 and squared empirical coefficient of variation CV 2 Z =Z 2 Z − 1. Then for the gamma distribution Z ∼ G(α, β) with a PDF the parameters α > 0 and β > 0 satisfy the relations, In case of the lognormal distribution Z ∼ LN (µ, σ) with a PDF the parameters µ ∈ R and σ > 0 are calculated by In case of a Pareto distribution Z ∼ P R(k, α) with a PDF x ≥ k 0 x < k the parameters k > 0 and α > 0 are calculated by relations Parameters of the proposed probability distributions are listed in Tables 5 and 6, respectively, for inter-arrival and service time distributions. The sensitivity of the optimal control policy to the shape of the distributions is tested by means of a two-sided t-test for samples with unknown but equal variances. Let g exp and g opt are the samples of the average cost values obtained for the optimal control policy in case of exponentially distributed times and for the system with proposed distributions for the inter-arrival and service times. These samples of size m are associated with the normally distributed random variables Z exp ∼ N (µ g exp , σ g exp ) and Z opt ∼ N (µ g opt , σ g opt ), where µ g exp , µ g opt ∈ R and σ g exp = σ g opt > 0. The test is defined then as where statistics S g opt ,g exp is calculated by (16). The results of tests in form of the p-value, the values of the average costsḡ exp andḡ opt together with their 95% confidence intervals are summarized in Tables 7 and 8 for the systems with different inter-arrival and service time distributions with smaller and greater levels of dispersion around the mean, d.h. for CV 2 ν i = CV 2 ζ i = 0.5 in Table 7 and CV 2 ν i = CV 2 ζ i = 20 in Table 8. Table cell contains two rows with the values for the average costsḡ exp andḡ opt together with confidence boundaries, and the third row has the p-value. From the numerical examples, it is observed that the shape of distributions expressed through a coefficient of variation has a high level of influence over the value of the average cost functionsḡ exp andḡ opt . In almost all cases, the average cost increases significantly when the coefficient of variation increases. Only in the case of the Pareto distribution for the interarrival and service times is the change in values not significant. However, an examination of the entries in the last two tables reveals that in all experiments the p-value exceeds the significance level of α = 0.05. Furthermore, it is worth noting that in most cases this exceeding is sufficient large. In this regard, the statistical test fails to reject null hypothesis at a given significance level, in other words, the average cost values are statistically equal and the corresponding optimal control policies are equivalent. Therefore, at least within the framework of the experiments conducted, we can state that the optimal scheduling policy is insensitive to the shape of the inter-arrival and service time distributions given that the first moments are equal. For practical purposes, in general queueing systems one can either apply the proposed optimization method, or use the control policy optimized for the equivalent exponential model as a suboptimal scheduling policy.

Conclusions
In this paper, we combined the queue simulation technique, neural network and simulated annealing optimization to calculate the optimal scheduling policy and optimized average cost function in a general single-server queueing system with multiple parallel queues. The proposed combination of tools is sufficiently versatile to solve discrete optimization problems that occur during resource allocation in complex queueing systems and networks. The numerical results subsequently demonstrate the effectiveness of the proposed approach. The obtained optimal scheduling policy outperforms the best available heuristic policy which is the cµ-rule by more than 45% on average. Nevertheless, a couple of important points must be stressed that can be considered when using the proposed method. In simulated annealing, the choice of initial control policy affects the speed of convergence to the optimal solution. Furthermore, it is required that the finite domain be defined for the solution. If the dimensionality of the state space allows, the initial control policy and the corresponding finite solution space can be obtained by the policy iteration algorithm implemented for the Markov model. The obtained optimal solution seems to be statistically insensitive to the form of inter-arrival and service time distributions where the first two moments are the same. Moreover, the optimal policy in exponential case can be treated as a suboptimal policy and the corresponding trained neural network can be used by routers in queueing systems with arbitrary distributions. In terms of future research, we see potential in developing and applying this method to other complex controlled queueing systems where the search for optimal routing, scheduling and resource allocation policies is required. The possibility to compose the reinforcement learning algorithms and neural networks to solve optimization problems in general controlled queueing models could also be considered as a further line of research.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The authors can be contacted to obtain data used in the study.